18 research outputs found

    Optimization techniques for fine-grained communication in PGAS environments

    Get PDF
    Partitioned Global Address Space (PGAS) languages promise to deliver improved programmer productivity and good performance in large-scale parallel machines. However, adequate performance for applications that rely on fine-grained communication without compromising their programmability is difficult to achieve. Manual or compiler assistance code optimization is required to avoid fine-grained accesses. The downside of manually applying code transformations is the increased program complexity and hindering of the programmer productivity. On the other hand, compiler optimizations of fine-grained accesses require knowledge of physical data mapping and the use of parallel loop constructs. This thesis presents optimizations for solving the three main challenges of the fine-grain communication: (i) low network communication efficiency; (ii) large number of runtime calls; and (iii) network hotspot creation for the non-uniform distribution of network communication, To solve this problems, the dissertation presents three approaches. First, it presents an improved inspector-executor transformation to improve the network efficiency through runtime aggregation. Second, it presents incremental optimizations to the inspector-executor loop transformation to automatically remove the runtime calls. Finally, the thesis presents a loop scheduling loop transformation for avoiding network hotspots and the oversubscription of nodes. In contrast to previous work that use static coalescing, prefetching, limited privatization, and caching, the solutions presented in this thesis focus cover all the aspect of fine-grained communication, including reducing the number of calls generated by the compiler and minimizing the overhead of the inspector-executor optimization. A performance evaluation with various microbenchmarks and benchmarks, aiming at predicting scaling and absolute performance numbers of a Power 775 machine, indicates that applications with regular accesses can achieve up to 180% of the performance of hand-optimized versions, while in applications with irregular accesses the transformations are expected to yield from 1.12X up to 6.3X speedup. The loop scheduling shows performance gains from 3-25% for NAS FT and bucket-sort benchmarks, and up to 3.4X speedup for the microbenchmarks

    Using shared-data localization to reduce the cost of inspector-execution in unified-parallel-C programs

    Get PDF
    Programs written in the Unified Parallel C (UPC) language can access any location of the entire local and remote address space via read/write operations. However, UPC programs that contain fine-grained shared accesses can exhibit performance degradation. One solution is to use the inspector-executor technique to coalesce fine-grained shared accesses to larger remote access operations. A straightforward implementation of the inspector executor transformation results in excessive instrumentation that hinders performance.; This paper addresses this issue and introduces various techniques that aim at reducing the generated instrumentation code: a shared-data localization transformation based on Constant-Stride Linear Memory Descriptors (CSLMADs) [S. Aarseth, Gravitational N-Body Simulations: Tools and Algorithms, Cambridge Monographs on Mathematical Physics, Cambridge University Press, 2003.], the inlining of data locality checks and the usage of an index vector to aggregate the data. Finally, the paper introduces a lightweight loop code motion transformation to privatize shared scalars that were propagated through the loop body.; A performance evaluation, using up to 2048 cores of a POWER 775, explores the impact of each optimization and characterizes the overheads of UPC programs. It also shows that the presented optimizations increase performance of UPC programs up to 1.8 x their UPC hand-optimized counterpart for applications with regular accesses and up to 6.3 x for applications with irregular accesses.Peer ReviewedPostprint (author's final draft

    MECCA - KPP Fortran to CUDA source-to-source pre-processor - Alpha Version

    No full text
    The MECCA - KPP parser is written in the Python programming language and generates CUDA compatible solvers, by parsing the auto-generated FORTRAN code output by the KPP preprocessor. The user executes the parser from the messy/util directory to transform the code. The parser modifies the messy/smcl/messy_mecca_kpp.f90 file and places a single call to the CUDA source file that contains the accelerated code (messy/smcl/messy_mecca_kpp_acc.cu) and a wrapper function for issuing the parallel kernels and copying the data to and from the GPU

    Σχεδιασμός και αξιολόγηση μιας παράλληλης εφαρμογής συμπίεσης βίντεο Η.264 με καταμερισμό υπολογισμών για τον επεξεργαστή Cell

    No full text
    Μοντέρνοι πολυπύρηνοι επεξεργαστές με ρητά διαχειριζόμενες τοπικές μνήμες, όπως ο επεξεργαστής Cell Broadband Engine (Cell), αποτελούν από πολλές απόψεις ένα σημαντικό σημείο στην σχεδίαση επεξεργαστών για υψηλές επιδόσεις. Ο εν λόγω επεξεργαστής, από την μία πλευρά προσφέρει υψηλές επιδόσεις σε συγκεκριμένες εφαρμογές και από την άλλη πλευρά απαιτεί εκτεταμένες τροποποιήσεις στην εφαρμογή. Σχεδιάσαμε και υλοποιήσαμε την εφαρμογή c264 για τον επεξεργαστή Cell. Η εφαρμογή c264 αποτελεί μια πλήρη υλοποίηση για συμπίεση βίντεο. Η.264, βασισμένη στη βιβλιοθήκη ανοικτού λογισμικού x264. Η υλοποίηση μας επιτυγχάνει επιτάχυνση 4.5x σε έξι synergistic processing elements (SPEs), σε σύγκριση με τη σειριακή εκτέλεση της εφαρμογής στην κεντρική επεξεργαστική μονάδα power processing element (PPE). Η υλοποίηση μας λαμβάνει υπόψιν όλα τα κομμάτια της συμπίεσης και αποκαλύπτει συναφείς περιορισμούς. Η εφαρμογή c264 είναι αποτέλεσμα ανασχεδιαμσού της αρχικής εφαρμογής x264, ώστε να επιτύχουμε παραλληλοποίηση με λεπτό καταμερισμό υπολογισμών μεταξύ εργασιών για να αντιμετωπίσουμε το μικρό μέγεθος της τοπικής μνήμης των SPEs και στην αλλαγή των κοινών δομών λόγω της μη συνεκτικής μνήμης του επεξεργαστή Cell. Η ανάλυση μας επιτρέπει να εντοπίσουμε τους κύριους περιορισμούς για την περαιτέρω κλιμάκωση της παράλληλης συμπίεσης βίντεο. Η.264 για μελλοντικούς επεξεργαστές πολλών πυρήνων: (Α) η επιβάρυνση για τη διαχείριση των εργασίων μπορεί να προκαλέσει μεγάλη μείωση επιδόσεων κεντρικού επεξεργαστή, (Β) σύνθετη ροή ελέγχου στον κώδικα περιορίζει τον βαθμό του διαθέσιμου παραλληλισμού, και (Γ) μικρές on-chip μνήμες περιορίζουν την επικάλυψη της επικοινωνίας με τον υπολογισμό.Modern multi-coe processors with explictly managed local memories, such as the Cell Broadband Engine (Cell) constitute in many ways a significant departure from traditional high performance CPU designs. Such CPUs, on one hand bear the potential of higher performance in certain application domains and on the other hand require extensive application modifications. We design and implement x264, a complete H.264 video encoder for the Cell processor, based on an open source H.264 library, c264. Our implementation achieves speedups of 4.5x on six synergistic processing elements (SPEs), compared to the serial version running on the power processing element (PPE). Our work considers all parts of the encoding process and reveals related limitations. x264 constitutes an extensive redesign of the original c264 code to employ fine-grain parallelization to cope with the small size of the local memory in the SPEs and achieve replication and privatization of shared data structures due to the non-coherent Cell architecture. Our analysis allows us to identify the main limitations for further scaling H. 264 video encoding on future multi-cores: (a) overheads for task management cause a heavy burden on the single master processor, (b) complex control flow in the code limits effective parallelism, and (c) small on-chip memories limit the overlap of communication and computation

    Medina: KPP Fortran to CUDA source-to-source pre-processor

    No full text
    MIT Licence<br><br>KPP Fortran to CUDA source-to-source pre-processor: Each CPU process that offloads to GPU requires a chunk of the GPU VRAM memory, dependent on the number of species and reaction constants in the MECCA mechanism. The number of GPUs per node and VRAM memory available in each GPU dictates the total number of CPU cores that can run simultaneously.<div><br></div

    Automatic communication coalescing for irregular computations in UPC language

    No full text
    Partitioned Global Address Space (PGAS) languages appeared to address programmer productivity in large scale parallel machines. However, fine grain accesses on shared structures have been identified as one of the main bottlenecks of PGAS languages. Manual or compiler assistance code optimization is required to avoid fine grain accesses. The downside of manually applying code transformations is the increased program complexity and hindering of the programmer productivity. On the other hand, compiler optimizations of fine grain accesses require knowledge of physical data mapping and the use of parallel loop constructs. This paper presents an optimization for prefetching and coalescing of shared accesses at runtime. Larger messages decrease the impact of remote access latency and increase the efficiency of the network communication. We have implemented our optimization for the Unified Parallel C (UPC) language. An experimental evaluation on a distributed-memory environment using a Power7 cluster demonstrates the benefits of our optimization.Peer ReviewedPostprint (published version

    Combining Static and Dynamic Data Coalescing in Unified Parallel C

    No full text

    Automatic communication coalescing for irregular computations in UPC language

    No full text
    Partitioned Global Address Space (PGAS) languages appeared to address programmer productivity in large scale parallel machines. However, fine grain accesses on shared structures have been identified as one of the main bottlenecks of PGAS languages. Manual or compiler assistance code optimization is required to avoid fine grain accesses. The downside of manually applying code transformations is the increased program complexity and hindering of the programmer productivity. On the other hand, compiler optimizations of fine grain accesses require knowledge of physical data mapping and the use of parallel loop constructs. This paper presents an optimization for prefetching and coalescing of shared accesses at runtime. Larger messages decrease the impact of remote access latency and increase the efficiency of the network communication. We have implemented our optimization for the Unified Parallel C (UPC) language. An experimental evaluation on a distributed-memory environment using a Power7 cluster demonstrates the benefits of our optimization.Peer Reviewe

    Reducing compiler-inserted instrumentation in unified-parallel-C code generation

    No full text
    Programs written in Partitioned Global Address Space (PGAS) languages can access any location of the entire address space via standard read/write operations. However, the compiler have to create the communication mechanisms and the runtime system to use synchronization primitives to ensure the correct execution of the programs. However, PGAS programs may have fine-grained shared accesses that lead to performance degradation. One solution is to use the inspector-executor technique to determine which accesses are indeed remote and which accesses may be coalesced in larger remote access operations. A straightforward implementation of the inspector-executor in a PGAS system may result in excessive instrumentation that hinders performance. This paper introduces a shared-data localization transformation based on linear memory descriptors (LMADs) that reduces the amount of instrumentation introduced by the compiler into programs written in the UPC language and describes a prototype implementation of the proposed transformation. A performance evaluation, using up to 2048 cores of a POWER 775 supercomputer, allows for a prediction that applications with regular accesses can achieve up to 180% of the performance of handoptimized versions while applications with irregular accesses yield performance gain from 1.12X up to 6.3X speedup
    corecore